Skip to content

Feat/hybrid search#16

Open
dorodb-web22 wants to merge 22 commits intoAOSSIE-Org:mainfrom
dorodb-web22:feat/hybrid-search
Open

Feat/hybrid search#16
dorodb-web22 wants to merge 22 commits intoAOSSIE-Org:mainfrom
dorodb-web22:feat/hybrid-search

Conversation

@dorodb-web22
Copy link

@dorodb-web22 dorodb-web22 commented Mar 7, 2026

Description

This PR introduces Hybrid Search to the AI module, building on top of the recent embedding pipeline by integrating keyword search and combining the results using Reciprocal Rank Fusion (RRF). This ensures that searches are robust against exact-match keywords while still benefiting from semantic understanding.

Key additions:

  • Added a pure-TypeScript BM25 KeywordSearchEngine with standard English stop-word filtering and no external dependencies.
  • Created HybridSearchService that orchestrates semantic and keyword searches in parallel.
  • Implemented standard Reciprocal Rank Fusion (k=60) for mathematically sound, weightless ranking.
  • Added 26 exact behavior-driven unit tests covering both the BM25 logic and RRF integration.

All 77 AI-related tests are passing locally and TypeScript compilation is clean. I'm open to any feedback on the implementation or tuning parameters!

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact.

Summary by CodeRabbit

  • New Features

    • Hybrid search: BM25 keyword engine + semantic ranking with reciprocal rank fusion
    • On-device embeddings service with configurable model/limits and chunk-aware embedding
    • In-memory vector store with cosine-similarity search, persistence, and related-note discovery
    • Centralized AI exports, shared types, default configs, and paragraph chunker
  • Tests

    • Extensive unit & integration tests covering embeddings, vector store, semantic, keyword, and hybrid search
  • Chores

    • Project init: package manifest, TypeScript build configs, test runner config, and .gitignore entries

@coderabbitai
Copy link

coderabbitai bot commented Mar 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a TypeScript AI search stack: embedding service, vector store with persistence, semantic and BM25 keyword search engines, a hybrid search coordinator with reciprocal rank fusion, shared types and barrel exports, project configs, and extensive unit tests.

Changes

Cohort / File(s) Summary
Project config
/.gitignore, package.json, tsconfig.json, tsconfig.build.json, vitest.config.ts
Adds VCS ignores, package manifest, TypeScript build config, and Vitest test/coverage settings.
Shared types & barrel
src/ai/types.ts, src/ai/index.ts
Defines core data types, defaults, and a central re-export barrel for AI modules.
Embeddings
src/ai/embeddings.ts
New EmbeddingService and EmbedFn type with lazy transformers import, injectable embed function, truncation heuristic, chunk handling, and init lifecycle.
Vector store
src/ai/vectorStore.ts
New in-memory VectorStore with cosine similarity, dimension validation, JSON persistence (version/model checks), search/findRelated, and note-scoped operations.
Semantic search
src/ai/semanticSearch.ts
New SemanticSearchService, ChunkFn/defaultChunker, indexing/replacement semantics, embedding integration, search, related-note finding, and persistence.
Keyword search
src/ai/keywordSearch.ts
New BM25-based tokenize and KeywordSearchEngine: inverted index, term frequencies, BM25 scoring, index/remove APIs, and stop-word filtering.
Hybrid search & fusion
src/ai/hybridSearch.ts
Adds reciprocalRankFusion and HybridSearchService that combines semantic and keyword results with configurable weights, filtering, and topK fusion.
Tests
src/ai/__tests__/*.test.ts
Adds extensive unit tests for embeddings, vector store, semantic, keyword, and hybrid search (mocks, persistence, edge cases, ranking/fusion).

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Hybrid as HybridSearchService
    participant Semantic as SemanticSearchService
    participant Keyword as KeywordSearchEngine
    participant Embedding as EmbeddingService
    participant Store as VectorStore
    participant RRF as RRF

    User->>Hybrid: search(query, options)
    par semantic path
        Hybrid->>Semantic: search(query, options)
        Semantic->>Embedding: embed(query)
        Embedding-->>Semantic: queryVector
        Semantic->>Store: search(queryVector, topK, minScore)
        Store-->>Semantic: semanticResults
        Semantic-->>Hybrid: semanticResults
    and keyword path
        Hybrid->>Keyword: search(query, topK)
        Keyword-->>Hybrid: keywordResults
    end
    Hybrid->>RRF: reciprocalRankFusion([semanticResults, keywordResults], k)
    RRF-->>Hybrid: fusedResults
    Hybrid-->>User: topK fusedResults
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

Typescript Lang

Poem

🐰 I nibble tokens, hop through streams of text,
Vectors hum and headings stitch the rest,
Keywords and meaning tumble, blend, and play,
Fusion hops in, ranking bright as day,
Hooray — search gardens bloom and dance away! 🌿

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title "Feat/hybrid search" is vague and uses a non-descriptive format (prefix/feature-name) rather than a clear summary. While it hints at the hybrid search feature, it lacks specificity about what was accomplished and doesn't clearly convey the main change to someone scanning commit history. Consider revising to a more descriptive, single-sentence title such as "Add hybrid search combining semantic embeddings with BM25 keyword search" or "Implement HybridSearchService with Reciprocal Rank Fusion for semantic and keyword ranking."
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai/__tests__/hybridSearch.test.ts`:
- Around line 42-51: The test helper buildService creates a VectorStore without
a modelId, so index compatibility won't be checked; update buildService to pass
a modelId (e.g., use the EmbeddingService's modelId) into the VectorStore config
when constructing it: after creating embeddingService (EmbeddingService),
include modelId: embeddingService.modelId in the new VectorStore({...}) options
so VectorStore performs the same model compatibility/invalidation behavior as in
production before returning the HybridSearchService.

In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 183-195: The test "skips write when nothing changed" currently
only checks file contents and can pass even if the file was rewritten; change it
to assert that the file was not rewritten by importing stat from
node:fs/promises and using stat(join(tempDir, "test-index.json")) to capture
file metadata (e.g., mtimeMs or size) before the second await store.save(), call
store.save() again, then stat the file again and expect the metadata to be
unchanged; alternatively, spy on the write path used by the VectorStore
persistence (the function that writes the file) to assert it was not called on
the second save. Also add the stat import to the top-level imports.

In `@src/ai/embeddings.ts`:
- Around line 88-91: The embedSingle method currently returns results[0] without
verifying the array is non-empty; update embedSingle to defensively validate the
output of embed (call to this.embed([text])) — check that results is an array
with at least one element and that results[0] is defined, and if not throw a
clear error (or handle per project error conventions) so callers won't receive
undefined; modify the embedSingle function to perform this guard and surface a
descriptive error referencing embedSingle/embed.

In `@src/ai/hybridSearch.ts`:
- Around line 7-29: reciprocalRankFusion is currently writing RRF values into
SearchResult.score which breaks the documented semantic (cosine) score; update
the implementation to preserve the original cosine similarity and expose the
fusion ranking in a separate field or type (e.g., create a new
HybridSearchResult type with fields { chunk: TextChunk; score: number;
fusionScore: number } or add fusionScore to the returned objects), set
fusionScore to the computed rrfScore while keeping score as the original
item.score, and do the same change in HybridSearchService.search (the block
referenced by lines 65-94) so callers receive the original similarity in score
and the fusion metric in fusionScore or the new hybrid result type.
- Around line 37-56: Currently indexNote calls semanticSearch.indexNote(noteId,
content) which re-chunks internally while the hybrid layer separately calls
this.chunkFn(noteId, content) for keyword indexing; instead generate chunks once
(const chunks = this.chunkFn(noteId, content)) and pass that same chunks array
to both engines (e.g., change semanticSearch.indexNote to accept chunks or add a
new method like semanticSearch.indexChunks(noteId, chunks)) then call
semanticSearch.indexChunks(noteId, chunks) and keywordEngine.indexChunks(chunks)
after removing by noteId so both engines index identical chunk objects.

In `@src/ai/semanticSearch.ts`:
- Around line 42-46: The initialize() sequence can leave embeddingService
initialized if vectorStore.load() throws; wrap the vectorStore.load() call in a
try/catch and on failure perform a deterministic rollback of the embedding
service (call the appropriate cleanup/shutdown/dispose method on
embeddingService, e.g., embeddingService.shutdown() or
embeddingService.dispose()) before rethrowing the error so state remains
consistent; keep the calls inside the existing initialize() method and preserve
the original error propagation.
- Around line 84-108: The hardcoded minScore 0.4 inside findRelatedNotes (used
in vectorStore.findRelated) should be extracted and documented: add a
configurable parameter or class-level constant (e.g., a new optional parameter
minScore: number = 0.4 on findRelatedNotes or a private readonly
DEFAULT_MIN_SCORE = 0.4 on the containing class) and replace the literal 0.4
with that identifier; update the method signature and call sites accordingly and
add a short comment explaining the threshold purpose and default.

In `@src/ai/vectorStore.ts`:
- Around line 123-124: Wrap the JSON.parse call that converts the file content
into a SerializedVectorStore in a try/catch to handle SyntaxError from
malformed/corrupted index files: around the lines using readFile(filePath,
"utf-8") and JSON.parse(raw) catch parsing errors, log or include the original
error details and throw a new, user-friendly error indicating the index file at
filePath is invalid and should be rebuilt (preserve the original error as cause
or append it to the message) so callers of the vector store load routine can
surface a clear rebuild instruction.
- Around line 151-157: The loaded index JSON needs structural validation before
populating this.entries: verify that the top-level data is an object and
data.entries is an array, then iterate and check each entry has entry.chunk &&
entry.chunk.id (string/number) and entry.vector (array/typed values) before
calling this.entries.set; for any invalid/missing fields either skip the entry
with a descriptive processLogger.warn or throw a clear Error that includes which
entry/index failed, so the loader (the method that currently clears this.entries
and loops over data.entries) fails with an actionable message instead of a
cryptic runtime error.

In `@vitest.config.ts`:
- Around line 3-14: Add explicit Vitest/Vite ESM handling for the
`@huggingface/transformers` package so tests that exercise real initialize() paths
(e.g., in embeddings.ts where initialize() may dynamic-import the library or
when embedFn is not provided) don't break under Node ESM resolution; update the
vitest.config.ts test config to include the package in server.deps.inline or
ssr.noExternal (or both) to force inlining/noExternal for
"@huggingface/transformers" so the module resolves correctly during tests and
runtime.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 399f4a0c-1dda-4f24-aa00-d754cf51318e

📥 Commits

Reviewing files that changed from the base of the PR and between a3ccb2b and 3b0db18.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (17)
  • .gitignore
  • package.json
  • src/ai/__tests__/embeddings.test.ts
  • src/ai/__tests__/hybridSearch.test.ts
  • src/ai/__tests__/keywordSearch.test.ts
  • src/ai/__tests__/semanticSearch.test.ts
  • src/ai/__tests__/vectorStore.test.ts
  • src/ai/embeddings.ts
  • src/ai/hybridSearch.ts
  • src/ai/index.ts
  • src/ai/keywordSearch.ts
  • src/ai/semanticSearch.ts
  • src/ai/types.ts
  • src/ai/vectorStore.ts
  • tsconfig.build.json
  • tsconfig.json
  • vitest.config.ts

Comment on lines +42 to +51
function buildService(embedFn?: EmbedFn) {
const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn());
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
});
const semantic = new SemanticSearchService(embeddingService, vectorStore);
const keyword = new KeywordSearchEngine();
return new HybridSearchService(semantic, keyword);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Test buildService omits modelId from VectorStore config.

The VectorStore is created without modelId, which means it won't perform model compatibility checks during load. While acceptable for isolated tests, this deviates from production behavior where modelId enables index invalidation on model changes.

Consider adding modelId to align test configuration with production:

     const vectorStore = new VectorStore({
         persistDir: "/tmp/hybrid-test",
         indexFilename: "test-index.json",
+        modelId: "mock-model",
     });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function buildService(embedFn?: EmbedFn) {
const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn());
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
});
const semantic = new SemanticSearchService(embeddingService, vectorStore);
const keyword = new KeywordSearchEngine();
return new HybridSearchService(semantic, keyword);
}
function buildService(embedFn?: EmbedFn) {
const embeddingService = new EmbeddingService({}, embedFn ?? createMockEmbedFn());
const vectorStore = new VectorStore({
persistDir: "/tmp/hybrid-test",
indexFilename: "test-index.json",
modelId: "mock-model",
});
const semantic = new SemanticSearchService(embeddingService, vectorStore);
const keyword = new KeywordSearchEngine();
return new HybridSearchService(semantic, keyword);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/__tests__/hybridSearch.test.ts` around lines 42 - 51, The test helper
buildService creates a VectorStore without a modelId, so index compatibility
won't be checked; update buildService to pass a modelId (e.g., use the
EmbeddingService's modelId) into the VectorStore config when constructing it:
after creating embeddingService (EmbeddingService), include modelId:
embeddingService.modelId in the new VectorStore({...}) options so VectorStore
performs the same model compatibility/invalidation behavior as in production
before returning the HybridSearchService.

Comment on lines +42 to +46
/** Load embedding model + persisted index. Call before anything else. */
async initialize(): Promise<void> {
await this.embeddingService.initialize();
await this.vectorStore.load();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential inconsistent state if vectorStore.load() throws.

If vectorStore.load() throws (e.g., on model mismatch per VectorStore.load() at lines 145-148), embeddingService remains initialized but the vector store is not loaded. This creates asymmetric state and the error isn't caught for recovery.

Consider wrapping the initialization sequence or documenting the expected caller behavior:

🛡️ Proposed fix to handle vectorStore.load() errors
     async initialize(): Promise<void> {
         await this.embeddingService.initialize();
-        await this.vectorStore.load();
+        try {
+            await this.vectorStore.load();
+        } catch (error) {
+            // Log and rethrow - caller should handle index rebuild
+            console.error("VectorStore load failed:", error);
+            throw error;
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/semanticSearch.ts` around lines 42 - 46, The initialize() sequence can
leave embeddingService initialized if vectorStore.load() throws; wrap the
vectorStore.load() call in a try/catch and on failure perform a deterministic
rollback of the embedding service (call the appropriate cleanup/shutdown/dispose
method on embeddingService, e.g., embeddingService.shutdown() or
embeddingService.dispose()) before rethrowing the error so state remains
consistent; keep the calls inside the existing initialize() method and preserve
the original error propagation.

Comment on lines +151 to +157
this.entries.clear();
for (const entry of data.entries) {
this.entries.set(entry.chunk.id, {
chunk: entry.chunk,
vector: entry.vector,
});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding structural validation for loaded data.

If the index file contains valid JSON but an unexpected structure (e.g., entries is not an array, or entries lack required chunk.id properties), the code will fail with a cryptic error. A brief structural check would improve debuggability.

♻️ Optional defensive validation
+        if (!Array.isArray(data.entries)) {
+            throw new Error(
+                `VectorStore: Invalid index structure. Delete "${filePath}" to rebuild.`
+            );
+        }
+
         this.entries.clear();
         for (const entry of data.entries) {
             this.entries.set(entry.chunk.id, {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
this.entries.clear();
for (const entry of data.entries) {
this.entries.set(entry.chunk.id, {
chunk: entry.chunk,
vector: entry.vector,
});
}
if (!Array.isArray(data.entries)) {
throw new Error(
`VectorStore: Invalid index structure. Delete "${filePath}" to rebuild.`
);
}
this.entries.clear();
for (const entry of data.entries) {
this.entries.set(entry.chunk.id, {
chunk: entry.chunk,
vector: entry.vector,
});
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/vectorStore.ts` around lines 151 - 157, The loaded index JSON needs
structural validation before populating this.entries: verify that the top-level
data is an object and data.entries is an array, then iterate and check each
entry has entry.chunk && entry.chunk.id (string/number) and entry.vector
(array/typed values) before calling this.entries.set; for any invalid/missing
fields either skip the entry with a descriptive processLogger.warn or throw a
clear Error that includes which entry/index failed, so the loader (the method
that currently clears this.entries and loops over data.entries) fails with an
actionable message instead of a cryptic runtime error.

Comment on lines +3 to +14
export default defineConfig({
test: {
globals: true,
environment: "node",
include: ["src/**/__tests__/**/*.test.ts"],
coverage: {
provider: "v8",
include: ["src/ai/**/*.ts"],
exclude: ["src/ai/__tests__/**", "src/ai/index.ts"],
},
},
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

# Check package.json for `@huggingface/transformers` dependency
grep -i "huggingface\|transformers" package.json

Repository: AOSSIE-Org/SmartNotes

Length of output: 106


🏁 Script executed:

# Search for any imports of `@huggingface/transformers` in the codebase
rg "@huggingface/transformers|from.*transformers" --type ts --type tsx --type js --type json

Repository: AOSSIE-Org/SmartNotes

Length of output: 92


🏁 Script executed:

# Read the full vitest.config.ts to see current configuration
cat -n vitest.config.ts

Repository: AOSSIE-Org/SmartNotes

Length of output: 531


🏁 Script executed:

# Check if there are any embeddings-related files or tests
fd -e ts -e tsx | grep -i embed

Repository: AOSSIE-Org/SmartNotes

Length of output: 121


🏁 Script executed:

# Check for any existing ESM/SSR configuration in vitest config
rg "ssr|inline|external|deps" vitest.config.ts

Repository: AOSSIE-Org/SmartNotes

Length of output: 47


🏁 Script executed:

# Check embeddings.ts imports
cat -n src/ai/embeddings.ts

Repository: AOSSIE-Org/SmartNotes

Length of output: 4511


🏁 Script executed:

# Check embeddings.test.ts content
cat -n src/ai/__tests__/embeddings.test.ts

Repository: AOSSIE-Org/SmartNotes

Length of output: 6516


🏁 Script executed:

# Search for all references to `@huggingface` in the codebase
rg "@huggingface" --type ts --type js

Repository: AOSSIE-Org/SmartNotes

Length of output: 249


Consider adding explicit Vitest configuration for @huggingface/transformers ESM handling.

While @huggingface/transformers (^3.4.0) is a declared dependency and embeddings.ts uses dynamic import with lazy loading, the current Vitest config lacks ESM/external package handling. The test suite currently avoids triggering the real import by injecting mock implementations, so tests pass. However, if code paths exercise the actual initialize() method without a custom embedFn, the Node ESM environment may require explicit server.deps.inline or ssr.noExternal configuration to prevent module resolution issues. This is a precautionary measure to ensure compatibility when real-world code paths are exercised.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vitest.config.ts` around lines 3 - 14, Add explicit Vitest/Vite ESM handling
for the `@huggingface/transformers` package so tests that exercise real
initialize() paths (e.g., in embeddings.ts where initialize() may dynamic-import
the library or when embedFn is not provided) don't break under Node ESM
resolution; update the vitest.config.ts test config to include the package in
server.deps.inline or ssr.noExternal (or both) to force inlining/noExternal for
"@huggingface/transformers" so the module resolves correctly during tests and
runtime.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/ai/hybridSearch.ts (1)

55-60: ⚠️ Potential issue | 🟠 Major

Index both engines from the same chunk array.

SemanticSearchService.indexNote() already re-chunks internally (src/ai/semanticSearch.ts:42-62), and this method chunks the raw content again for BM25. If those chunkers ever diverge, the same chunk ID can point to different text in each engine and RRF will fuse unrelated hits. Generate the chunks once here and feed that exact array to both indexes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 55 - 60, The code currently chunks
content twice causing mismatched chunk IDs; call this.chunkFn(noteId, content)
once to produce a single chunks array and feed that same array to both engines.
Update the call site in hybridSearch.indexNote to use the single chunks array
and change SemanticSearchService.indexNote (or add a new
semanticSearch.indexChunks) to accept the precomputed chunks array instead of
raw content so it does not re-chunk internally; then pass chunks to
keywordEngine.indexChunks(chunks) and to the semantic search method, and remove
the internal re-chunking in SemanticSearchService.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai/__tests__/hybridSearch.test.ts`:
- Around line 42-47: buildService currently hardcodes VectorStore to
"/tmp/hybrid-test", making tests non-hermetic; change buildService to create a
unique temp directory per invocation (e.g., using fs.mkdtempSync or a UUID
appended to os.tmpdir()) and pass that path into the new VectorStore({
persistDir, indexFilename }) so each test gets its own directory, and ensure the
test suite's beforeEach()/afterEach() (or initialize()/teardown) removes or
cleans that temp dir after the test to avoid cross-test contamination.

In `@src/ai/hybridSearch.ts`:
- Around line 51-53: The initialize() method restores only the semantic store
(semanticSearch) but does not rebuild or restore the BM25 keyword index
(keywordEngine), so keywordEngine is empty after restart; update initialize() to
load persisted chunks/documents (the same source semanticSearch uses) and
rebuild or re-index keywordEngine from those persisted chunks, and update save()
to persist either the keywordEngine state or the chunks/documents used to
rebuild it (so keywordEngine can be reconstructed on startup); reference the
initialize() and save() methods, keywordEngine, semanticSearch and the persisted
chunks/documents used by the vector store when implementing this.
- Around line 76-87: The keyword results are being truncated by
KeywordSearchEngine.search before you apply the noteId filter, causing
note-scoped searches to miss matches; fix by applying noteId filtering inside
the keyword search path: either extend KeywordSearchEngine.search to accept an
optional noteId (and filter before slice) and call keywordEngine.search(query,
broadK, opts.noteId) here, or request the full/un-truncated set (e.g., topK =
Infinity or a large number) so you can filter by opts.noteId in this module
before slicing to broadK; update references to keywordEngine.search and ensure
filteredKeyword is produced from the pre-sliced results.

---

Duplicate comments:
In `@src/ai/hybridSearch.ts`:
- Around line 55-60: The code currently chunks content twice causing mismatched
chunk IDs; call this.chunkFn(noteId, content) once to produce a single chunks
array and feed that same array to both engines. Update the call site in
hybridSearch.indexNote to use the single chunks array and change
SemanticSearchService.indexNote (or add a new semanticSearch.indexChunks) to
accept the precomputed chunks array instead of raw content so it does not
re-chunk internally; then pass chunks to keywordEngine.indexChunks(chunks) and
to the semantic search method, and remove the internal re-chunking in
SemanticSearchService.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 05c80561-5fea-473f-a2bf-67aaf9d9ebff

📥 Commits

Reviewing files that changed from the base of the PR and between 3b0db18 and 3a26f06.

📒 Files selected for processing (4)
  • src/ai/__tests__/hybridSearch.test.ts
  • src/ai/hybridSearch.ts
  • src/ai/index.ts
  • src/ai/types.ts

@github-actions github-actions bot removed the size/XL label Mar 7, 2026
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/ai/hybridSearch.ts (2)

72-83: ⚠️ Potential issue | 🟠 Major

Filter keyword hits before truncating to topK.

KeywordSearchEngine.search() slices before returning, so filtering that reduced list at Line 81 can drop valid note-scoped matches whenever other notes occupy the first broadK BM25 slots.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 72 - 83, Keyword search results are
pre-sliced by KeywordSearchEngine.search, so filtering by opts.noteId after that
(into filteredKeyword) can drop valid note-scoped hits; change the flow to fetch
untruncated keyword hits (e.g., expose/use a searchAll / unsliced variant or
call keywordEngine.search with a sufficiently large K) then apply the noteId
filter (opts.noteId) to that full set and only after filtering truncate to
broadK (used alongside semanticResults); update usages around
keywordEngine.search, filteredKeyword, broadK, and opts.noteId accordingly.

47-49: ⚠️ Potential issue | 🟠 Major

Persist or rebuild the BM25 index on startup.

initialize() restores only the semantic store, and save() persists only that same state. After a restart, keywordEngine is empty, so hybrid search silently degrades to semantic-only until every note is reindexed.

Also applies to: 96-98

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/hybridSearch.ts` around lines 47 - 49, initialize() currently only
restores the semantic store so keywordEngine/BM25 is empty after restart; update
initialize() (and the save() path referenced near lines 96-98) to persist and
restore the BM25/keywordEngine state or trigger a rebuild: add serialization of
the BM25 index when save() runs and load/deserializer logic in initialize() to
repopulate keywordEngine, and if no serialized snapshot exists call the existing
reindex routine (e.g., reindexAll or the method that iterates notes) to rebuild
the BM25 index on startup; ensure you reference and update keywordEngine,
initialize(), save(), and the BM25 serialization/rebuild helpers so hybridSearch
is fully restored after restart.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/ai/hybridSearch.ts`:
- Around line 72-83: Keyword search results are pre-sliced by
KeywordSearchEngine.search, so filtering by opts.noteId after that (into
filteredKeyword) can drop valid note-scoped hits; change the flow to fetch
untruncated keyword hits (e.g., expose/use a searchAll / unsliced variant or
call keywordEngine.search with a sufficiently large K) then apply the noteId
filter (opts.noteId) to that full set and only after filtering truncate to
broadK (used alongside semanticResults); update usages around
keywordEngine.search, filteredKeyword, broadK, and opts.noteId accordingly.
- Around line 47-49: initialize() currently only restores the semantic store so
keywordEngine/BM25 is empty after restart; update initialize() (and the save()
path referenced near lines 96-98) to persist and restore the BM25/keywordEngine
state or trigger a rebuild: add serialization of the BM25 index when save() runs
and load/deserializer logic in initialize() to repopulate keywordEngine, and if
no serialized snapshot exists call the existing reindex routine (e.g.,
reindexAll or the method that iterates notes) to rebuild the BM25 index on
startup; ensure you reference and update keywordEngine, initialize(), save(),
and the BM25 serialization/rebuild helpers so hybridSearch is fully restored
after restart.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bb239b31-74cd-431f-891e-2c3a8539b8d7

📥 Commits

Reviewing files that changed from the base of the PR and between 3a26f06 and dbcab75.

📒 Files selected for processing (2)
  • src/ai/hybridSearch.ts
  • src/ai/semanticSearch.ts

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 7, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/ai/vectorStore.ts (1)

149-165: ⚠️ Potential issue | 🟠 Major

Validate every loaded row before rebuilding the map.

This path only proves that the JSON parses and that the first entry matches vectorDim. Valid JSON with the wrong shape can still fail with opaque TypeErrors at Line 149 or Line 161, and a later mixed-dimension row slips through until search time. Validate entries up front and check each row's chunk fields plus vector length before populating this.entries.

Proposed fix
+        if (!Array.isArray(data.entries)) {
+            throw new Error(
+                `VectorStore: invalid index structure in "${filePath}". ` +
+                `Delete the file to rebuild the index.`,
+            );
+        }
+
-        if (data.vectorDim !== undefined && data.entries.length > 0) {
-            const actualDim = data.entries[0].vector.length;
-            if (actualDim !== data.vectorDim) {
-                throw new Error(
-                    `VectorStore: index vectorDim mismatch ` +
-                    `(stored ${data.vectorDim}, found ${actualDim}). Index may be corrupted.`,
-                );
-            }
-        }
-
         this.entries.clear();
-        for (const entry of data.entries) {
+        for (const [index, entry] of data.entries.entries()) {
+            if (
+                !entry ||
+                !entry.chunk ||
+                typeof entry.chunk.id !== "string" ||
+                typeof entry.chunk.noteId !== "string" ||
+                typeof entry.chunk.content !== "string" ||
+                typeof entry.chunk.chunkIndex !== "number" ||
+                !Array.isArray(entry.vector)
+            ) {
+                throw new Error(
+                    `VectorStore: invalid entry at index ${index} in "${filePath}". ` +
+                    `Delete the file to rebuild the index.`,
+                );
+            }
+            if (
+                data.vectorDim !== undefined &&
+                entry.vector.length !== data.vectorDim
+            ) {
+                throw new Error(
+                    `VectorStore: invalid vector dimension at entry ${index} in "${filePath}".`,
+                );
+            }
             this.entries.set(entry.chunk.id, {
                 chunk: entry.chunk,
                 vector: entry.vector,
             });
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/vectorStore.ts` around lines 149 - 165, Validate the loaded payload
before rebuilding this.entries: confirm data.entries is an array and iterate
each entry to ensure it has a chunk object with required fields (at least
chunk.id) and a numeric vector array whose length matches data.vectorDim (when
data.vectorDim is defined) and all elements are numbers; throw a clear Error
identifying the offending entry if any check fails; only after all entries pass
validation, clear this.entries and populate it using entry.chunk.id -> { chunk:
entry.chunk, vector: entry.vector } so malformed rows never get inserted
(referencing data.entries, data.vectorDim, entry.chunk, entry.vector, and
this.entries).
src/ai/__tests__/vectorStore.test.ts (1)

188-199: ⚠️ Potential issue | 🟡 Minor

Assert the no-op, not just the final contents.

Line 191 says the second save() should be a no-op, but this test still passes if save() rewrites the same JSON. Capture file metadata before/after the second call, or spy on the write path, so the test proves the optimization instead of only re-reading the final file contents.

Proposed fix
-import { mkdtemp, rm, readFile, writeFile } from "node:fs/promises";
+import { mkdtemp, rm, readFile, writeFile, stat } from "node:fs/promises";
...
         it("skips write when nothing changed", async () => {
             store.add([makeChunk("n1", 0, [1, 0, 0])]);
             await store.save();
+            const before = await stat(join(tempDir, "test-index.json"));
             await store.save(); // second save should be no-op
+            const after = await stat(join(tempDir, "test-index.json"));

             const content = await readFile(
                 join(tempDir, "test-index.json"),
                 "utf-8",
             );
             const data = JSON.parse(content);
+            expect(after.mtimeMs).toBe(before.mtimeMs);
             expect(data.version).toBe(1);
             expect(data.entries).toHaveLength(1);
         });

As per coding guidelines, **/*.test.{ts,tsx,js,jsx}: The tests are not tautological.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/__tests__/vectorStore.test.ts` around lines 188 - 199, The test
currently only verifies final file contents but must assert the second
store.save() is a true no-op; modify the "skips write when nothing changed" test
to capture file metadata or spy the write path before the second save (e.g.,
stat the file mtime or spy on fs.writeFile/Store's internal write method used by
save()), call await store.save() the second time, then assert that no write
occurred (mtime unchanged or spy not called) in addition to the existing content
checks; locate the logic around store.save() and the file read (readFile/join
usage) to add the stat/spy and the assertion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai/vectorStore.ts`:
- Around line 28-33: The add(embeddedChunks: EmbeddedChunk[]) method currently
mutates this.entries before validating embedding sizes, allowing mixed-dimension
vectors to be stored; update add to first determine the expected vector length
(use an existing this.dimension property or set it from the first inserted
ec.embedding.length if missing) and validate every ec.embedding.length against
that expected dimension, throwing an error and refusing to mutate this.entries
if any mismatch is found; ensure you set this.dimension on the first successful
insert and only mark this.dirty = true after all entries have been validated and
inserted.

---

Duplicate comments:
In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 188-199: The test currently only verifies final file contents but
must assert the second store.save() is a true no-op; modify the "skips write
when nothing changed" test to capture file metadata or spy the write path before
the second save (e.g., stat the file mtime or spy on fs.writeFile/Store's
internal write method used by save()), call await store.save() the second time,
then assert that no write occurred (mtime unchanged or spy not called) in
addition to the existing content checks; locate the logic around store.save()
and the file read (readFile/join usage) to add the stat/spy and the assertion.

In `@src/ai/vectorStore.ts`:
- Around line 149-165: Validate the loaded payload before rebuilding
this.entries: confirm data.entries is an array and iterate each entry to ensure
it has a chunk object with required fields (at least chunk.id) and a numeric
vector array whose length matches data.vectorDim (when data.vectorDim is
defined) and all elements are numbers; throw a clear Error identifying the
offending entry if any check fails; only after all entries pass validation,
clear this.entries and populate it using entry.chunk.id -> { chunk: entry.chunk,
vector: entry.vector } so malformed rows never get inserted (referencing
data.entries, data.vectorDim, entry.chunk, entry.vector, and this.entries).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f46da6c2-eb49-4565-8c70-ea5406ae89a6

📥 Commits

Reviewing files that changed from the base of the PR and between dbcab75 and 021baf9.

📒 Files selected for processing (2)
  • src/ai/__tests__/vectorStore.test.ts
  • src/ai/vectorStore.ts

@sharma-sugurthi
Copy link

@dorodb-web22 really solid first implementation- BM25 + RRF + local embeddings in pure TypeScript with 77 tests is impressive work.

A few things I noticed:

Blocking concerns:

  • BM25 state is lost on restart (hybridSearch.ts initialize()). after app restart, keywordEngine comes back empty while semanticSearch restores from disk. This means hybrid search silently falls back to semantic-only. Needs either BM25 persistence or an index rebuild from persisted chunk data on startup.

  • noteId filter is applied after topK truncation in keywordEngine.search(). If other notes dominate the top-K BM25 results, the per-note filter returns nothing even when matches exist. The filter needs to happen before the slice.

  • vector dimension not validated in VectorStore.add(). A mismatched vector silently enters the store and poisons all future queries. Worth a cheap dimension check on insert.

  • minor: tests in hybridSearch.test.ts share a fixed /tmp/hybrid-test path — this makes the suite non-hermetic across runs. A per-test temp dir (via mkdtemp) would fix it.

the per PR fixes you have been pushing (chunker reuse, error descriptiveness, fusionScore) are exactly the right things to do. the 3 blocking items above are the main things I would want to see resolved before merge.

happy to discuss any of them...

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/ai/vectorStore.ts (1)

164-179: ⚠️ Potential issue | 🟠 Major

Validate every loaded entry before rebuilding the index.

At Line 164-Line 179, load() still trusts persisted JSON shape and only checks dimension consistency using the first vector. A malformed/corrupted file can still inject invalid entries or mixed dimensions, causing downstream runtime failures during search.

🔧 Proposed hardening for load-time validation
-        if (data.vectorDim !== undefined && data.entries.length > 0) {
-            const actualDim = data.entries[0].vector.length;
-            if (actualDim !== data.vectorDim) {
-                throw new Error(
-                    `VectorStore: index vectorDim mismatch ` +
-                    `(stored ${data.vectorDim}, found ${actualDim}). Index may be corrupted.`,
-                );
-            }
-        }
-
-        this.entries.clear();
-        for (const entry of data.entries) {
-            this.entries.set(entry.chunk.id, {
-                chunk: entry.chunk,
-                vector: entry.vector,
-            });
-        }
+        if (!data || !Array.isArray(data.entries)) {
+            throw new Error(
+                `VectorStore: invalid index structure in "${filePath}". Delete it to rebuild.`,
+            );
+        }
+
+        const expectedDim =
+            data.vectorDim ??
+            (data.entries.length > 0 ? data.entries[0].vector.length : undefined);
+
+        this.entries.clear();
+        for (let i = 0; i < data.entries.length; i++) {
+            const entry = data.entries[i];
+            if (
+                !entry ||
+                !entry.chunk ||
+                typeof entry.chunk.id !== "string" ||
+                typeof entry.chunk.noteId !== "string" ||
+                !Array.isArray(entry.vector)
+            ) {
+                throw new Error(
+                    `VectorStore: invalid entry at index ${i} in "${filePath}". Delete it to rebuild.`,
+                );
+            }
+
+            if (expectedDim !== undefined && entry.vector.length !== expectedDim) {
+                throw new Error(
+                    `VectorStore: inconsistent vector dimensions in "${filePath}" ` +
+                    `(expected ${expectedDim}, found ${entry.vector.length} at entry ${i}).`,
+                );
+            }
+
+            this.entries.set(entry.chunk.id, {
+                chunk: entry.chunk,
+                vector: entry.vector,
+            });
+        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/vectorStore.ts` around lines 164 - 179, The load() path currently only
checks vectorDim against the first entry and then trusts data.entries; instead,
validate every entry before rebuilding this.entries: iterate over data.entries
and for each entry confirm entry.chunk and entry.chunk.id exist, entry.vector is
an array of numbers, and entry.vector.length matches data.vectorDim (or infer
and set vectorDim if undefined) and throw a descriptive Error on mismatch; only
after all entries pass validation populate this.entries (used in the for loop
that calls this.entries.set) to avoid injecting malformed or mixed-dimension
vectors into the index.
src/ai/__tests__/vectorStore.test.ts (1)

188-200: ⚠️ Potential issue | 🟡 Minor

Make the no-op save test prove no second write happened.

Line 188-Line 200 currently validates content only, so it still passes if save() rewrites identical JSON on the second call.

✅ Proposed test tightening
-import { mkdtemp, rm, readFile, writeFile } from "node:fs/promises";
+import { mkdtemp, rm, readFile, writeFile, stat } from "node:fs/promises";
@@
         it("skips write when nothing changed", async () => {
             store.add([makeChunk("n1", 0, [1, 0, 0])]);
             await store.save();
+            const before = await stat(join(tempDir, "test-index.json"));
             await store.save(); // second save should be no-op
+            const after = await stat(join(tempDir, "test-index.json"));
 
             const content = await readFile(
                 join(tempDir, "test-index.json"),
                 "utf-8",
             );
             const data = JSON.parse(content);
+            expect(after.mtimeMs).toBe(before.mtimeMs);
             expect(data.version).toBe(1);
             expect(data.entries).toHaveLength(1);
         });

As per coding guidelines, **/*.test.{ts,tsx,js,jsx}: The tests are not tautological.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai/__tests__/vectorStore.test.ts` around lines 188 - 200, The test "skips
write when nothing changed" currently only asserts file contents, so it can pass
if save() rewrites identical JSON; update the test to prove the second save did
not perform a write by either (a) spying on the file write API used by
VectorStore (e.g. jest.spyOn(fs.promises, "writeFile") or the actual writer used
by store.save) and asserting it was called exactly once, or (b) recording the
file mtime via fs.promises.stat on "test-index.json" before the second await
store.save() and asserting the mtime did not change after the second save;
reference the test name, store.save, and the "test-index.json" file when adding
the spy/stat checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/ai/__tests__/vectorStore.test.ts`:
- Around line 188-200: The test "skips write when nothing changed" currently
only asserts file contents, so it can pass if save() rewrites identical JSON;
update the test to prove the second save did not perform a write by either (a)
spying on the file write API used by VectorStore (e.g. jest.spyOn(fs.promises,
"writeFile") or the actual writer used by store.save) and asserting it was
called exactly once, or (b) recording the file mtime via fs.promises.stat on
"test-index.json" before the second await store.save() and asserting the mtime
did not change after the second save; reference the test name, store.save, and
the "test-index.json" file when adding the spy/stat checks.

In `@src/ai/vectorStore.ts`:
- Around line 164-179: The load() path currently only checks vectorDim against
the first entry and then trusts data.entries; instead, validate every entry
before rebuilding this.entries: iterate over data.entries and for each entry
confirm entry.chunk and entry.chunk.id exist, entry.vector is an array of
numbers, and entry.vector.length matches data.vectorDim (or infer and set
vectorDim if undefined) and throw a descriptive Error on mismatch; only after
all entries pass validation populate this.entries (used in the for loop that
calls this.entries.set) to avoid injecting malformed or mixed-dimension vectors
into the index.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b588922c-3618-4ad6-bff6-a50b6d1151d4

📥 Commits

Reviewing files that changed from the base of the PR and between 021baf9 and 2dd32ec.

📒 Files selected for processing (2)
  • src/ai/__tests__/vectorStore.test.ts
  • src/ai/vectorStore.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants